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Abstract 

Adaptation to local environments often occurs through natural selection acting on a large num- 
ber of alleles, each having a weak phenotypic effect. One way to detect these alleles is to identify 
genetic polymorphisms that exhibit high correlation with environmental variables used as proxies 
for ecological pressures. Here we propose an integrated framework based on population genetics, 
ecological modeling and statistical learning techniques to screen genomes for signatures of local 
adaptation. These new algorithms introduce latent factor mixed models to population genetics, 
employing an approach based on probabilistic principal component analysis in which population 
structure is introduced via unobserved variables. These fast, computationally efficient algorithms 
detect correlations between environmental and genetic variation while simultaneously inferring 
background levels of population structure. Comparing these new algorithms with related methods 
provides evidence that latent factor models can efficiently estimate random effects due to popula- 
tion history and isolation-by-distance patterns when computing gene-environment correlations, and 
decrease the number of false-positive associations in genome scans. We then apply these models 
to plant and human genetic data, identifying several genes with functions related to development 
that exhibit unusual correlations with climatic gradients. 



1 Introduction 



Local adaptation through natural selection plays a central role in shaping the variation of natural 
populations [l|[2j| and is of fundamental importance in evolutionary, conservation, and global-change 
biology [3||7]. The intensity of natural selection commonly varies in space, and can result in gene- 
environment interactions that have measurable effects on fitness (e.g. [s]). Adaptive divergence can 
then cause local populations to evolve traits that provide advantage under local environmental condi- 
tions. 

In principle, identifying chromosomal regions involved in adaptive divergence can be achieved by 
scanning genome-wide patterns of DNA polymorphism 9p!0 . Usually, the aim of screening procedures 
is to detect locus specific signatures of positive selection. In populations inhabiting spatially distinct 
environments, loci that underlie adaptive divergence can be detected by comparing relative levels 
of differentiation among large samples of unlinked markers [IT 12 and by using empirical tests to 



compare levels of differentiation to the genomic background 113 , 14 



An alternative way to investigate signatures of local adaptation, especially when beneficial alleles 
have weak phenotypic effects, is by identifying polymorphisms that exhibit high correlation with 
environmental variables [3,15-18]. In natural populations, quantitative traits that exhibit continuous 
geographic variation are often associated with specific variables reflecting selective pressures acting on 



individual phenotypes 19 . This type of variation is then reflected in geographic clines or in sympatric 



populations that exploit different ecological niches 20 . Evidence for local adaptation to continuous 
environments can be detected if there is highly significant association with the environmental variables 
at some loci compared to the background genomic variation. 

However the geographical basis of both environmental and genetic variation can confound inter- 
pretation of the tests [21], as local adaptation can be hindered by gene flow |22|, and can be difficult 



to distinguish from the effects of genetic drift and demographic history 14 . As a consequence, when 
no corrections for the effects of population structure or isolation-by-distance are considered, tests for 
associations between loci and environmental variables using classical regression models are prone to 



high rates of false-positives 23 . Recent studies have used the background patterns of allele frequencies 



to build a null model which accounts for the effects of drift and demographic history 115 18 24 ,25 . To 



correct for population stratification, [151 used an empirical approach that estimates the covariance of 
allele frequencies among populations. These authors assessed the evidence for local adaptation of each 
allele by testing whether or not environmental variables explained more variance than a null model 
with this particular covariance structure. 

Here we argue that, unless a list of a priori selectively neutral loci are used to build the empirical 
covariance matrix, empirical tests may face a problem of circularity. The need to identify neutral loci 
from the genomic background before testing implies that these tests lack power to reject neutrality. 
In this study, we address this problem by introducing new statistical models, called latent factor 
mixed models. Using these models, we test correlations between environmental and genetic variation 
while estimating the effects of hidden factors that represent background residual levels of population 
structure. To perform parameter estimation, we extend probabilistic principal component analysis 



and recent statistical learning approaches 26-28 . Based on low rank approximation of the residual 
covariance matrix, we implement algorithms to deal with hundreds of thousands of polymorphisms 
with very rapid computing times. We show that our algorithms control for random effects due to 
population history and spatial autocorrelation when estimating gene-environment association, and we 
provide examples of how our approach can be used to detect local adaptation in plants and humans. 

2 Method 

Consider the data matrix, (Gu), where each entry records the allele frequency in population or indi- 
vidual i at the genomic locus £, 1 < i < n, 1 < i < L, and n and L represent the total sample size 
and number of loci, respectively. For simplicity, we assume our loci are biallelic, e.g. single nucleotide 
polymorphisms (SNPs), and data are avaiable for each individual. In this case, for each marker, there 
is an ancestral and a derived allele, and Ga is the number of derived alleles for locus £ and individual i. 
For diploid data, Ga is thus equal to 0, 1 or 2, and corresponds to the genotype at locus £. In addition 
to the genotypic data, we have a vector of d geographic and environmental variables, (Xi), for each 
individual. The vector of covariates could include latitude and longitude, habitat and other ecological 
information, climatic variables, etc, that serve as proxies for unknown environmental pressures (For 
example, see (l5|[2T]). 



Model. To evaluate associations between allele frequencies and environmental variables while 
correcting for background levels of population structure, we regard the matrix G as being a response 
variable in a regression mixed model 

G^i = //, + l3jX^ + U^Ve + e,i , (1) 

where is a locus specific effect, is a d-dimensional vector of regression coefficients, Ui and 
are scalar vectors with K dimensions (1 < < n). The residuals ej£ are statistically independent 
Gaussian variables of mean zero and variance cr^. We use Bayesian analysis to estimate the regression 
coefficients and their standard deviations. We assume Gaussian prior distributions on /i^ and /3^j with 
means equal to zero and variances cr^ and o"^, {fiij ~ N(0, cr^,)). Prior distributions on Ui and 
are Gaussian distributions with means equal to zero and constant variance for each component (the 
components are independent random variables). The variance of is set to ay = 1, and all other 
prior distributions on variances are non-informative. We refer to the above statistical model as a 
Latent Factor Mixed Model (LFMM). 

In LFMMs, environmental variables are introduced as fixed effects while population structure is 
modeled via latent factors. To separate neutral variation from adaptive variation, the matrix term 
U'^V models the part of genetic variation that cannot be explained by the environmental pressures. 
Note that the use of factorization methods is closely related to estimating population structure via 
singular value decomposition, a well-established technique for identifying scores and loadings in prin- 
cipal component analysis (PGA, ^29j). Recently, matrix factorization methods have been generalized 
to include probabilistic PGA [26] and probabilistic matrix factorization algorithms f27|, which have 
proven useful in analyzing population genetic data [28] . To clarify the connection between LFMM and 
PGA, assume that no environmental variable is available. In this case, we set Pi = for all locus i. 
In matrix factorization algorithms, a data matrix G with n rows and L columns can be decomposed 
into a product of two matrices U and V, where U has n rows and K columns, and y is a K-hy-L 



matrix. Following 30 , we assume that the genotypic data are centered. We consider the matrix 



Yii = Gii — Gj, where we have substracted the mean value of each column, G,£ = Y17=i Gu/n. For 



each individual i and locus i, the decomposition is 

K 

Yie = UlVe = ^U,kVM. (2) 

k=l 

To estimate the factor vectors Ui and Vi, the squared error is minimized on the set of observed data 

K 
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{Yu - UikVkif ■ (3) 



With K = L, this approach is similar to computing PCA loadings and scores 29 . The number 
of components K can, however, be chosen much lower than the number of loci or individuals. In 
simulations, we based our choice of K on Tracy- Widom theory |30|. In real applications, this choice 
of K may be replaced by other estimates of population genetic structure. When values are lower than 
50, our algorithm is essentially a low-rank approximation of the covariance structure |31j , which leads 
to computationally fast estimation algorithms. 

To simultaneously estimate scores and loadings, environmmental effects and biases, we imple- 
mented a Gibbs sampler algorithm for LFMMs (File SI). The Gibbs sampler is based on computing 
products of matrices of low dimension (typically, K < 50), and its speed scales with the current size 
of SNP data sets, around n ~ 1, 000 and L ~ 500, 000. In addition, we implemented a stochastic algo- 
rithm to compute standard deviations and |2;|-scores for the environmental effects. Using the empirical 
distribution of |z|-scores obtained from all L loci, we compared each locus to the genomic background 
and retained loci with |2:|-scores exhibiting the highest absolute values. From a preliminary set of 
experiments using data simulated from the model defined in equation ([T]), we found that the esti- 
mates of fixed effects stabilized quickly, after 1,000 to 10,000 sweeps for n = 100 — 1,000 individuals 
and L = 1,000 — 100,000 loci. A 10-fold increase in the number of sweeps, however, was necessary 
to recover the true values of the latent factors. Additionally, we developed numerical optimization 
methods to compute maximum a posteriori (MAP) estimates for the LFMM. One of these methods, 
the alternate least square method uses deterministic steps that are similar to our stochastic Gibbs 
sampler |32j . When checking for convergence of the MCMC algorithm, we also found that least square 
estimates of regression coefficients were close to the point estimates computed by the Gibbs sampler 
method. 

Theoretical considerations. Incorporating population genetic structure via estimates of admixture 
proportions or principal component analysis is common in genome- wide association studies |33l|34|). 



[18] developed an alternative approach to identify loci underlying local adaptation in the computer 
program Bayenv. To explain the difference between our approach and Bayenv, suppose that we start 
by computing PCA scores from the matrix Y for all individuals, and denote by Ui the PCA scores for 
individual i. The product matrix UU^ is thus equal to the empirical covariance matrix 

UU^ = YY^/n . (4) 

Now using the scores as covariates in a Bayesian regression model, we obtain 

G = fi + p^X + U^V + e. (5) 

By a change of variables, this is equivalent to fitting the model 

G = n + + e (6) 

where the distribution of e is a multivariate Gaussian distribution of the covariance matrix equal to 
cj^Id + ayYY^ /n (Id is the n-dimensional identity matrix). Setting ay = 1 and considering small 
values of the scaling parameter cr^ , the model defined in equation Q is nearly equivalent to the model 
implemented in Bayenv. In a Bayesian Gaussian regression framework, incorporating PCA scores 
as covariates in an association model is equivalent to modeling residuals as Gaussian vectors with 
covariance depending on the empirical covariance matrix of the genotypic data. Though the Bayenv 
model uses a different link function, its residual term has the same covariance matrix as the genotypic 
data. From a theoretical point of view, the main difference between the Bayenv model and LFMM is 
the inference of the factor matrix U which is done in a fully Bayesian algorithm in LFMMs and in an 
empirical Bayes algorithm in Bayenv (see Discussion). 

3 Simulation Study 

We designed experiments based on simulated data to answer the following questions: 1) Are tests 
based on LFMMs conservative or liberal? 2) How does the LFMM algorithm perform compared to 
existing methods such as logistic or standard regression models [sl, principal components regression 



and other existing mixed models 18 ? 



Distribution of P-values under the null hypothesis. We used equation ([T]) with /3 = to 
generate data under a null hypothesis of no association with any environmental variables. In these 
experiments, we set the number of individuals to n = 100, and the number of loci to L = 1,000. 
We used 6 values, K = 1,3,5,7,10 and 20, for the rank of the factor matrix, V. For each series of 
experiments, we generated 10 replicates of this generative model, and we studied the distributions of 
P-values for tests using LFMMs. In these tests, we set the rank of the factor matrix equal to the 
values we used to generate simulations. 
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Figure 1: Empirical cumulative distribution function (ECDF) for LFMM tests for simulations from a 
generative model using A) K=5, B) K=20 latent factors. 

Figure 1 reports the empirical cumulative distribution function (ecdf) for K = 5 and K = 20. 
Plots for the other values of K are shown in Figure SI. For values of K lower than 5, the ecdf was 
close to a uniform distribution. For K = 20, the tests were slightly conservative. Thus, for moderate 
and for large values of the number of latent factors, the tests produced small numbers of false positive 
associations. 

Next we used equation ([T]) to generate data showing various levels of population structure and 
association with an environmental variable. The environmental variable was uniformly generated 
in the range (0,1). Here we used 3 values for the rank of the factor matrix, K = 2,20 and 100, 



representing low, moderate and high levels of underlying population genetic structure. For each series 
of experiments, we generated 20 replicates of the generative model and compared the distribution of 
statistical errors for three estimation approaches: 1) LFMM, 2) standard linear regression model, 3) 
PC regression model. With the notations from section 2, these models were defined as follows. The 
LFMM was defined by equation 

where we set the rank of the factor matrices equal to the values we used to generate simulations. The 
standard regression model was defined as 

Git = M£ + (ij^i + ^ii ■ (7) 
The PC regression model was defined as 

Git = m + fiJ^i + 'Ujyi^^it, (8) 

where (C/j) are the first K PCs computed from the matrix G. To compute point estimates of environ- 
mental effects and their jzj-scores, Gibbs sampler algorithms were run for 1,000 sweeps after a burnin 
period of 100 sweeps. For these particular run length parameters, we checked that similar estimates 
were obtained for distinct initializations of the algorithm. For each locus, we recorded both the true, 
Bi^ and estimated environmental effects, 5^, and evaluated the absolute error 
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Quantiles of absolute errors (LFMM) 

Figure 2: Quantiles of absolute errors for the standard linear regression, PC regression and LFM 
models using simulations from the LFM model with A) K=2, B) K=20 and C) K=100 latent factors. 

Figure 2 reports the quantiles of absolute errors for the LFMM, the standard linear regression and 
PC regression models. For the LFMM, absolute errors ranged between and 0.6 for K = 2 — 20, 
and between and 1.0 for K = 100. Mean squared errors indicated that the bias and variance 
of estimates were small (Table 1). Compared to LFMMs, the relative errors of the linear and PC 
regression estimates increased with the rank of the hidden factor matrix. The absolute errors of these 
algorithms ranged between and 1.4 for K = 2, between and 3.2 for K = 20, and between 
and 9.2 for K = 100. When linear or PC regression models were fitted to the data, the quantiles of 
errors shifted to values ~ 1.74- fold higher for X = 2, ~ 3.8 to 4.1-fold higher for K = 20, and ~ 5.5 
to 7.7-fold higher for K = 100. Mean squared errors provide additional evidence of relatively poor 
performances of the linear regression and PC regression estimates when levels of underlying structure 
increase (Table 1). 

Spatial coalescent simulations. In another series of experiments, we compared the LFMM esti- 
mation algorithm against two methods that do not correct for population stratification, and against 
two methods that use the empirical covariance matrix as a correction. The first set of methods in- 
clude a linear model and generalized linear model (LM and GLM, js]), and the second set of methods 
include a PC regression model (PCRM) and the mixed model Bayenv |18|. To enable comparisons, we 



simulated genotypic data from spatial coalescent models with the computer program ms |35| . Ten data 
sets were generated according to a linear stepping-stone model with 40 demes, setting the effective 
migration rate between pairs of adjacent demes to the value ANm = 25. Sampling 5 individuals in 
each deme, each data set included a total of n = 200 haploid individuals genotyped at L = 1,000 
SNP loci. Using Tracy- Widom tests implemented in SmartPCA, we found that the number of principal 
components with P-values smaller than 0.01 was around Xtw = 7. We ran the Gibbs sampler algo- 
rithm during 100 sweeps for burnin, and we used the next 900 sweeps to compute points estimates, 
variances and -scores. 

Distribution of P- values. To examine the outcome of tests when genetic variation is neutral at all 
loci, we computed the distributions of P-values under a LM, GLM, PCRM and LFMM with different 
values for the number of latent factors {K ranging from 1 to 20). The distributions of P- values for 
tests based on LM and GLM showed a strong departure from the uniform distribution (Figure 3A-B). 
In those cases, the tests were too liberal, and produced a large number of false positive results. Using 
K = 7 latent factors or PCs, the distribution of P-values for tests based on an LFMM or PCRM was 
much closer to a uniform distribution (Figure 3C-D). In addition, we found that choosing K based on 
Tracy- Widom theory led to slightly conservative tests for the particular simulation settings used here. 
Ecdf for all values of K are shown in Figures S2 and S3, respectively. 
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Figure 3: Empirical cumulative distribution function (ECDF) for A) the linear regression model (LM), 
B) the generalized linear model (GLM), C) the LFM model using K = 7 latent factors (LFMM), and 
D) the PC regression model using K = 7 principal components (PCRM). 



Next we evaluated the ability of LFMMs to detect loci exhibiting correlations with particular 
environmental gradients, and compared tests based on an LFMM with methods based on linear models 
and the computer program Bayenv [iS]. An environmental variable, x, was defined for each population 
as the geographic identifier of the population in the linear stepping-stone model. We created an 
environmental gradient using a logistic function, s{x), of x as follows 

For each of the 10 previously generated neutral stepping-stone simulations, we simulated binary alleles 
for each deme x at 50 loci with frequency s{x), with the slope of the gradient 9 = 0.2. We then obtained 
10 data sets with L = 1050 loci including 50 loci correlated with the environmental gradient, s{x). 
Across these datasets, smartPCA estimates between 5 and 7 significant eigenvectors {P < 0.01 in the 



Tracy- Widom test). For the simulated data sets, we evaluated the percentage of false negative (FN) 
and of false positive (FP) tests based on LM, GLM, PCRM and LFMM, for two values of the type I 
error (Table 2). 

Using 9 = 0.2 in simulations of non-neutral loci, we found that linear models exhibited high 
percentages of FP. In contrast, tests based on PCRMs exhibited very large percentages of FN, and 
had no power to reject neutrality. Tests based on LFMM produced low numbers of FP, and had 
reasonable power to reject the null hypothesis of no association. 

Comparisons with Bayenv. For each of the 10 previously generated neutral stepping-stone sim- 
ulations, we simulated binary alleles for each deme x at 50 loci with frequency s{x), using 9 = 0.1. 
To enable comparision with the program Bayenv, which returns Bayes factors instead of P-values, we 
considered ranked lists recording the M loci corresponding to the strongest (true) associations (M 
between 1 and 1,050). For each M, we computed the number of true positives and the number of 
false negatives. Locus ranking was performed on the basis of |z| -scores in LFMM, and on the basis of 
Bayes factors in Bayenv. The LFMM tests used values of K equal to K = 1, 3, 5, 7, 10 and 20, and we 
used the default parameters of the Bayenv algorithm to compute Bayes factors (run length of 30,000 
sweeps). Experiments were assessed by measuring the area under the receiver-operating characteristic 
curve (AUG) averaged over 10 replicates. The mean AUG for tests based on LFMM with K = 5 — 7 
factors were around 0.95 — 0.96 whereas the AUG for Bayenv was equal to 0.88. In the linear stepping 
stone model simulations, the tests based on LFMM obtained better performances than Bayenv for all 
values of K (Figure 4). 
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Figure 4: Detection of outlier loci. Number of true positives for the Bayenv model and the Latent 
Factor Mixed Model for K = 7 (Tracy-Widom value) and for K = 1,3, 5, 10, 20 for spatial coalescent 
simulations including loci under selection. 



4 Data Analysis 

To illustrate the application of LFMMs, we analyzed genomic data sets of Loblolly Pines [Pinus taeda, 
Pinaceae, [2l]) and humans (Human Genome Diversity Panel, [36] ) . 

Loblolly Pine 

The Loblolly Pine is distributed throughout the southeastern USA, ranging from the arid Great Plains 
to the humid Eastern Temperate Forest ecoregion. The data consisted of 1,730 SNPs selected in 



expressed sequence tags (EST) for 682 individuals [21] . Following 21 , we considered 5 environmental 
variables representing the five first components of a PCA for 60 climatic variables. The first component 
(PCI) was mainly described by latitude, longitude, temperature and winter aridity. PC2 was described 
by longitude, spring-fall aridity and precipitation f21|. 

For each of the 5 environmental variables, we applied the LFMM algorithm using 100 sweeps 
for burnin and 400 additional sweeps to compute |z|-scores for all loci. Based on a prior analysis 
of the genotypic data with the program SmartPCA, we used K = 10 latent factors. A total of 392, 



113 and 30 SNPs obtained |2;|-scores greater than 3, 4 or 5 for at least one environmental variable, 
respectively. Based on this result, we considered that a SNP effect was significant when its |z|-score 
was greater than 4 (two-sided test). Among the 50 loci with the highest |2;|-scores, 17 were shared with 
those detected by [2l| using Bayenv. Seven of the 10 SNPs with Bayes factors greater than 10^ were 
confirmed by the LFMM analysis. For the first and second environmental variables, the two SNPs 
which obtained the highest Bayes factors using Bayenv were recovered by the LFMM analysis. Table 3 
provides a list of SNPs associated with climatic gradients and their functional annotation. Compared 



to the analysis of 21 , the LFMM analysis discovered new significant and interesting associations 
with climatic gradients, for example, the chloroplast lumen 19 kDA protein involved in photosynthesis 
(l^l = 6.42), a pentatricopeptide repeat protein involved in oxidative stress and salt stress {\z\ = 5.90), 
and the heat shock transcription factor hsf5 {\z\ = 5.60) involved in regulation of transcription and 
response to temperature stress (Table 3 and Table SI). 

Human data analysis 

We applied an LFMM analysis to a worldwide sample of genomic DNA from 1,043 individuals in 
52 populations, referred to as the Human Genome Diversity Project - Centre Etude Polymorphism 
Humain (HGDP-CEPH) Human Genome Diversity Cell Line Panel (hagsc.org/hgdp/). The genotypes 
were generated on Illumina 650K arrays [36j, and the data were filtered to remove low quality SNPs 
included in the original files. 

We extracted climatic data for each of the 52 population samples using the WorldClim data set 



at 30 arcsecond (Ikm^) resolution 37 . These data include 11 bioclimatic variables interpolated from 
global weather station data collected during a 50 year period (1950-2000). The climatic variables 
included annual mean temperature, mean diurnal range, maximum temperature of warmest month, 
minimum temperature of coldest month, annual precipitation, etc (Table S2). We summarized the 
climatic variables by using the first axis of a principal component analysis. For this first principal 
component, we applied the LFMM algorithm to compute |2:|-scores for each locus with = 50 latent 
factors, using 100 sweeps for burnin and 900 additional sweeps to warrant convergence. 

A total of 2,624 (0.4%), 508 (0.08%) and 65 (0.007%) SNPs obtained |z|-scores greater than 5, 6 
or 7, respectively (Figure S3). Among loci with |z|-scores greater than 5, 28 GWAS-SNPs with known 



disease or trait association were found [381 ). These include several SNPs discovered by 24 . For 
example the SNPs rsl2913832 and rs28777, |2;|-scores greater than 6, are associated with genes 0CA2 
and SLC45A2 (Table 4). Among the SNPs significantly correlated with climatic gradients, several 
notable examples include genes associated with celiac disease (ICOSLG), height (LHX3-QSOX2 and 
IGFl), and vitamin D synthesis or activation (NADSYNl encoding nicotinamide adenine dinucleotide 
synthetase and DHCR7 the gene encoding 7-dehydrocholesterol reductase, an enzyme catalyzing the 
production in skin of cholesterol from 7-dehydrocholesterol (Table 4). Among the 65 SNPs with \z\- 
scores greater than 7, 31 were located within genes (Table S3). One interesting result is that several 
genes are associated with multicellular organismal development. EPHB4 {\z\ = 8.90) is involved in 
heart morphogenesis and angiogenesis, NRGl {\z\ = 7.15) involved with nervous system development 
and cell proliferation, RBM19 {\z\ = 7.04) involved with positive regulation of embryonic development, 
EYA2 {\z\ = 7.09) involved with eye development and DNA repair, and POLAl {\z\ = 7.63) involved 



with the mitotic cell cycle and cell proliferation 39 , 40] . 

5 Discussion 

Interpretation of LFMM results and other methods. Based on a matrix factorization ap- 
proach, LFMMs incorporate a unified framework for estimating effects of environmental and demo- 
graphic factors on genetic variation. Without environmental variables, LFMMs are equivalent to 
performing a probabilistic PCA of allele frequencies [26j. When environmental variables are included, 
hidden factors capture the part of genetic variation that cannot be explained by the set of measured 
environmental variables. This fraction of genetic variation could result from the demographic history 
of the species, unknown environmental pressures or from IBD patterns. 

While a plethora of statistical tests have been proposed for detecting genes evolving under positive 
selection and local adaptation |10^|14j , the development of tests based on correlations with habitat or 
landscape variables is still recent (3|[l5]. Compared to methods based on summary statistics, tests 
based on environmental association have increased power to detect selection from standing genetic 
variation and soft sweeps in a species genome [6|[T7]. However, simple implementation of these tests, 
for example simple linear or logistic regression models, can be misleading in the presence of IBD 



patterns [23| . Our simulation results provide clear evidence that tests based on LFMMs significantly 
reduce the rates of false positive associations in the presence of IBD. 

While both the mixed model approach of the computer program Bayenv and the LFMM approach 
includes a covariance structure in a regression model, there are important differences between the two 
approaches. A first improvement is that LFMMs estimate latent factors and regression coefficients 
simultaneously, while Bayenv first estimates a covariance matrix, and then uses it when estimating 
(random) environmental effects. To apply Bayenv, the authors suggest utilizing selectively neutral 
SNPs to estimate the covariance matrix. This approach requires separating neutral from adaptive 
variation a priori, and is difficult to apply when selection acts on phenotype at a large number of loci. 
Inclusion of adaptive markers in the "neutral set" is sometimes unavoidable, and in this case, Bayenv 
may overlook interesting associations. This distinction between approaches explains the observed 
differences in the lists of outlier loci for Loblolly pines, where 1,730 SNPs were geno typed in expressed 
sequences. For these data it was extremely difficult to select neutral SNPs from the background a priori. 
Another distinction between LFM and Bayenv approaches is our use of low rank approximations of the 
covariance matrix. LFMMs actually estimate correlations between environmental predictors and allele 
frequencies while K hidden factors explain residual genetic variation, where K is much smaller than 
the sample size. The low rank approximation is computationally faster than Bayenv when analyzing 
large data sets. 

Number of latent factors. In the LFM modeling approach, the choice of low values for K is 
important for optimizing the computational performances of the estimation algorithm. This choice is 
reminiscent of selecting the number of components in PCA or in Bayesian clustering programs, and it 
has also an impact on test outcomes. For values of K taken too large, the tests are highly conservative, 
and the power to reject neutrality declines. Values of K that minimize the trade-off between the bias 
and variance for our statistical estimates could be obtained by using cross-validation procedures, but 
cross-validation procedures are computationally intensive, so instead we use Tracy- Widom theory to 
select K ^Oj. We evaluated this choice during our simulation analysis, and found that it led to slightly 
conservative tests. Although the choice of Tracy- Widom estimates is suboptimal, the performances 
of LFMMs were still superior to those of Bayenv in simulations of IBD patterns. In the analysis of 



human data, we restricted K to be less than 50 (approximately the number of population samples). 
We suggest that, when there is a reasonable estimate of the number of genetic cluster for a species, it 



can be used in LFMM tests directly. While finer grain population structure could be evaluated 41 
our choice was again motivated by a trade-off between accuracy and run-time. A future development 
of our LFMM approach will be to develop fast numerical optimization procedures based on variational 
approximations of the likelihood, which will allow us to implement cross-validation algorithms and 
increase the power of tests. 



Plant and human data. For Pinus taeda, the LFMM results confirmed that several ESTs previ- 
ously discovered with Bayenv had functions linked to climate [2l]. In addition, the LFMM analysis 
discovered new interesting candidate SNPs. Those variants include functions associated with wound 
repair and immunity, photosynthetic activity and carotenoid biosynthesis, cellular respiration and car- 
bohydrate metabolism, heat, salt and oxidative stress responses (Table 3). Applying LFMMs to the 
HGDP data, we found that a total of 0.4% of all polymorphisms (2,624 SNPs) exhibited significant 
associations with temperature gradients {\z\ > 5). For example, we identified SNPs associated with 
the gene 0CA2 that may be functionally linked to blue or brown eye color and the gene SLC45A2 that 



may be associated with skin pigmentation 24 . This list also contained SNPs associated with height 
and vitamin D synthesis and diseases such as gluten intolerance and Crohn's disease. Our list of genie 
SNPs with 1^1 -scores greater than 7 {\z\ > 7) was enriched in genes involved in organismal develop- 
ment. For example, the genes EPHB4, BOK, and NRGl —with functions related to heart and brain 
development— were associated with climatic gradients. Overall, the analysis confirmed that many loci 
are associated with climatic gradients or to correlated evolutionary pressures (for example, pathogenic 
environment). This result supports the hypothesis that soft sweeps may have been common in recent 
human evolution 1171. 



Conclusion. With ever increasing amounts of genetic data generated by high-throughput sequenc- 
ing technologies, population genetic methods have shifted from traditional statistical approaches to 
approaches that use statistical learning techniques. Estimates of ancestry and other population param- 



eters are commonly obtained from mixture models 42 -44 , principal component analyses |30], hidden 



Markov models f45| and factor analysis |28j. Our study contributes to the machine learning toolbox 
for population and landscape genomic analysis by implementing new gene-environment association 
tests based on matrix factorization methods. 

Software availability. Source codes and computer programs for fitting LFMMs are available from 
the authors web-sites. 
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Supplementary Text 2 



FIGURE SI: Empirical cumulative distribution function for LFMM tests for simu- lations from gen- 
erative models with K = 1,3, 5, 10, and 20 latent factors. 




FIGURE S2: Empirical cumulative distribution function for the LFMM using K = 1,3,5,7,10, and 
20 latent factors. 
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FIGURE S3: Empirical cumulative distribution function for the PC regression model using K = 
1, 3, 5, 7, 10, and 20 latent factors. 
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Gibbs Sampling algorithm for the LFMM 



5.1 Prior Distribution 

Let D is the number of environmental variables. li j is the indicator variable equal to if the data are missing 
and 1 otherwise. N(/x, S) is the normal distribution of mean /z and of covariance matrix S. r~^{a,b) is the 
inverse-gamma distribution of shape a and of rate b (ie of scale j). 
The prior distributions on the LFMM parameters are given by: 

for all ij Gi,i\Ui,Vi,pi,iie,c7'' r^n{Xipi + Hi + UfVe,ay*-' (10) 

for alii Ui\(jl^n{{),allK) (11) 

forain y^|4~N(0,4/K) (12) 

forain,d /3d,,|a^(^) ~N(0,a2(,)) (13) 

forain Mdf^^~N(0,c72) (14) 
ay = 1 and cr^ is updated at each iteration by the current residual variance. 

for all d cr|((i), cr^ and af^ follow an inverse-gamma distribution r~^(?7,77) where 77 = 10^. 

5.2 Conditional Distribution 

The LFM Model is a hierarchical model which conditional distributions can be described as follows: 

p{al\U, n) = r-\rj + ^ E U^U^ + ^) (15) 

i 

pi^^mlP, = + f ' ^ E ^le + ^) (16) 

pialM = T-\rj + ^, 1 + rj) (17) 

e 

p{Ui\G, V, P, ,i,al,a^) = N(/x^, A^"') (18) 

= Ik + a^~' E m'c/ = ^^~\^hr' E(<^'/ - ^i^^ - W)^^ (19) 

p{Ve\G, U, 13, n, aa) = Hf^i, A^^"') (20) 

A'y = ay' Ik + cr^^' E ^'^^i ^'v = ^^~\^v)~' EC^*.^ ' ' (21) 

i i 

p{Pi\G, U, V, ap{lf, ap{d)\ a^) = N(m^, A^^"') (22) 

where 

A'^ = diag{a0{lf-\ . . .,a^{df-') + a^-'Y,XjXi and 4 = ^^"'(A^)-! ^{G^^t - UfV, - ^it)Xf (23) 

i i 

piMG, U, V, /?, al, a') = f^if^,, Ay') (24) 

where 

K = ^r' + 4 = E(^''^ - ^^^^ - ^'^^) (25) 



where 



where 



5.3 Main algorithm 

niter is tlie number of iterations and burn is the number of iterations for burning. 



1. 


Initiahze model parameters 




U = Ok,n 
V = Ok,m 
P = Om,d 
M = Om,i 




2. 


For n = 1 . . . niter 










• input missing values at locus £ for individu i, 












(n-l) 




• update the residual variance 












^ = var{G - - X("- 






• sample the hyperparameters 




























• for each locus £, sample 
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• for each individu i, sample 
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2(n)-) 




• for each locus /, sample 








3. 


compute the parameters 




V = mean(y(''"™+^\ . . . , 
p = mean(/3(''"™+i\ . . . , 






Z = mean{l3'^^ 







